-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[New Scheduler] Add duration checker #4984
Conversation
} | ||
} | ||
|
||
trait DurationCheckerProvider extends Spi { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is based on the SPI.
|
||
actionMetaData.binding match { | ||
case Some(binding) => | ||
client |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the Some and None cases can call a helper function since the only difference in the query is the List
to match on and pass that List
as a param.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated accordingly
|
||
import scala.concurrent.Future | ||
|
||
object NoopDurationCheckerProvider extends DurationCheckerProvider { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so this means you operate the schedule without using the average activation duration for an action heuristic. How important is using the heuristic for the performance of the scheduler?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is just for other DBs such as CouchDB or CosmosDB in case the scheduler is used with other than ES.
(Even if it is highly recommended to use with ES.)
Regarding the average duration, it is important to improve the accuracy of calculation but the queue can still work without it. When an action is newly created, there is no activation accordingly no average duration.
In such a case, it assumes one container can handle one activation for the given time.
So even if one container can handle multiple activations for a given period, it assumes a container can handle only one activation, so schedulers would tend to overprovision containers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the case of couchdb or cosmosdb, there is no average activation duration calculation since it uses this correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes.
If required, anyone can create it as it is based on SPI.
And one thing I forgot to tell you is, after this duration checker is landed, we introduced one more optimization.
Initially, the average duration was always calculated based on this module, but now(in our downstream), this is only used when a queue is newly created. After then, the queue uses the duration passed from containers.
As per POEM2, each container autonomously pulls an activation by sending a fetch-request. So when they send the fetch-request, we added one more field lastDuration
. So the queue can keep the recent N
duration in the circular queue and calculate the average duration without any external API call.
But when a new queue is created or an action is newly created, there is no data in the circular queue and the duration checker is used in such cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah okay cool so if I understand what you just said correctly once the invoker is running, we still will get to use the average activation duration heuristic since we track it in memory. The elasticsearch spi is just for startup. That's good to know the optimization you described sounds more important we'll still get the benefits of tracking activation duration
LGTM |
851d440
to
8e698ae
Compare
Codecov Report
@@ Coverage Diff @@
## master #4984 +/- ##
===========================================
+ Coverage 29.09% 75.89% +46.80%
===========================================
Files 195 206 +11
Lines 9553 10122 +569
Branches 413 450 +37
===========================================
+ Hits 2779 7682 +4903
+ Misses 6774 2440 -4334
Continue to review full report at Codecov.
|
8e698ae
to
80fef2f
Compare
It seems sometimes the build is failed even if it passed all tests because it failed to upload the file.
Is this to upload the log files? |
It keeps failing.
|
It seems this endpoint no longer accepts any file.
Could this be because the disk space is fully used? @dgrove-oss @rabbah |
We failed to upload the logs to Box. This happens because we don't have a way to automatically remove old logs and the Box folder fills up. However, the upload to Box is optional (will not cause the travisci job to fail). There is a real test failure earlier in the log: https://travis-ci.org/github/apache/openwhisk/jobs/744844186#L7457 |
I will change this upload to use a different object store - will give it a
try soon.
…On Sun, Nov 22, 2020 at 3:32 PM David Grove ***@***.***> wrote:
We failed to upload the logs to Box. This happens because we don't have a
way to automatically remove old logs and the Box folder fills up. However,
the upload to Box is optional (will not cause the travisci job to fail).
There is a real test failure earlier in the log:
https://travis-ci.org/github/apache/openwhisk/jobs/744844186#L7457
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#4984 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABF25MXSECTY3YXBH6YOU5DSRFYPPANCNFSM4RY4XZXQ>
.
|
a4ca2c0
to
0032850
Compare
0032850
to
307527e
Compare
Finally, it has passed all tests cases :) |
- The scheduler PR apache/openwhisk#4984 introduced changes that required adaption in the setup of the openwhisk environment used for the automated tests.
- The scheduler PR apache/openwhisk#4984 introduced changes that required adaption in the setup of the openwhisk environment used for the automated tests.
- The scheduler PR apache/openwhisk#4984 introduced changes that required adaption in the setup of the openwhisk environment used for the automated tests.
- The scheduler PR apache/openwhisk#4984 introduced changes that required adaption in the setup of the openwhisk environment used for the automated tests.
- The scheduler PR apache/openwhisk#4984 introduced changes that required adaption in the setup of the openwhisk environment used for the automated tests.
- The scheduler PR apache/openwhisk#4984 introduced changes that required adaption in the setup of the openwhisk environment used for the automated tests.
Description
This is a subsequent PR of #4983, once #4983 is merged, I would rebase this again.
Major changes are:
This is to add a duration checker for ElasticSearch.
With a new scheduler, it is important to decide when and how many containers to add.
The scheduler will calculate the average duration for the recent N activations and compute the processing power of one container, e.g. how many activations can be handled by one container in a given time. Factoring in the average duration, the number of incoming activations, and the number of activations in a queue, the scheduler can add more containers to handle the given activations.
Related issue and scope
My changes affect the following components
Types of changes
Checklist: